The dataset we will be working with represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

  1. It is an inpatient encounter (a hospital admission).
  2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
  3. The length of stay was at least 1 day and at most 14 days.
  4. Laboratory tests were performed during the encounter.
  5. Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.

The dataset is avaliable from the UCI Machine Learning Repository.

The project is a 2-way classification task in which we try to predict if a patient will either

  1. Not be readmitted
  2. Readmitted in <30 days

Read in the dataset

# Reading in data
data <- read.csv("dataset/diabetic_data.csv")

# remove duplicates
data <- data[!duplicated(data$patient_nbr), ]

# summary of the data 
str(data)
## 'data.frame':    71518 obs. of  50 variables:
##  $ encounter_id            : int  2278392 149190 64410 500364 16680 35754 55842 63768 12522 15738 ...
##  $ patient_nbr             : int  8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
##  $ race                    : Factor w/ 6 levels "?","AfricanAmerican",..: 4 4 2 4 4 4 4 4 4 4 ...
##  $ gender                  : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 2 2 2 2 1 1 ...
##  $ age                     : Factor w/ 10 levels "[0-10)","[10-20)",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weight                  : Factor w/ 10 levels "?","[0-25)","[100-125)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ admission_type_id       : int  6 1 1 1 1 2 3 1 2 3 ...
##  $ discharge_disposition_id: int  25 1 1 1 1 1 1 1 1 3 ...
##  $ admission_source_id     : int  1 7 7 7 7 2 2 7 4 4 ...
##  $ time_in_hospital        : int  1 3 2 2 1 3 4 5 13 12 ...
##  $ payer_code              : Factor w/ 18 levels "?","BC","CH",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ medical_specialty       : Factor w/ 73 levels "?","AllergyandImmunology",..: 39 1 1 1 1 1 1 1 1 20 ...
##  $ num_lab_procedures      : int  41 59 11 44 51 31 70 73 68 33 ...
##  $ num_procedures          : int  0 0 5 1 0 6 1 0 2 3 ...
##  $ num_medications         : int  1 18 13 16 8 16 21 12 28 18 ...
##  $ number_outpatient       : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ number_emergency        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_inpatient        : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ diag_1                  : Factor w/ 717 levels "?","10","11",..: 126 145 456 556 56 265 265 278 254 284 ...
##  $ diag_2                  : Factor w/ 749 levels "?","11","110",..: 1 81 80 99 26 248 248 316 262 48 ...
##  $ diag_3                  : Factor w/ 790 levels "?","11","110",..: 1 123 768 250 88 88 772 88 231 319 ...
##  $ number_diagnoses        : int  1 9 6 7 5 9 7 8 8 8 ...
##  $ max_glu_serum           : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ repaglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ nateglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ chlorpropamide          : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glimepiride             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ acetohexamide           : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glipizide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
##  $ glyburide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
##  $ tolbutamide             : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pioglitazone            : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ rosiglitazone           : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
##  $ acarbose                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ miglitol                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ troglitazone            : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tolazamide              : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ examide                 : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ citoglipton             : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
##  $ glyburide.metformin     : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glipizide.metformin     : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.pioglitazone  : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ change                  : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
##  $ diabetesMed             : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ readmitted              : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...

Data Preparation and Preprocessing

Before moving on to do some data cleaning, let’s observe the descriptive statistics of each column in the training dataset.

The skimr package provides a nice solution to show key descriptive stats for each column.

data[data == "?"] <- NA
skimmed <- skim_to_wide(data)
knitr::kable(skimmed[, c(1:3, 5:6, 9:10, 16)])
type variable missing n n_unique mean sd hist
factor A1Cresult 0 71518 4 NA NA NA
factor acarbose 0 71518 3 NA NA NA
factor acetohexamide 0 71518 2 NA NA NA
factor age 0 71518 10 NA NA NA
factor change 0 71518 2 NA NA NA
factor chlorpropamide 0 71518 4 NA NA NA
factor citoglipton 0 71518 1 NA NA NA
factor diabetesMed 0 71518 2 NA NA NA
factor diag_1 11 71518 696 NA NA NA
factor diag_2 294 71518 725 NA NA NA
factor diag_3 1225 71518 758 NA NA NA
factor examide 0 71518 1 NA NA NA
factor gender 0 71518 3 NA NA NA
factor glimepiride 0 71518 4 NA NA NA
factor glimepiride.pioglitazone 0 71518 1 NA NA NA
factor glipizide 0 71518 4 NA NA NA
factor glipizide.metformin 0 71518 2 NA NA NA
factor glyburide 0 71518 4 NA NA NA
factor glyburide.metformin 0 71518 4 NA NA NA
factor insulin 0 71518 4 NA NA NA
factor max_glu_serum 0 71518 4 NA NA NA
factor medical_specialty 34477 71518 70 NA NA NA
factor metformin 0 71518 4 NA NA NA
factor metformin.pioglitazone 0 71518 2 NA NA NA
factor metformin.rosiglitazone 0 71518 2 NA NA NA
factor miglitol 0 71518 4 NA NA NA
factor nateglinide 0 71518 4 NA NA NA
factor payer_code 31043 71518 17 NA NA NA
factor pioglitazone 0 71518 4 NA NA NA
factor race 1948 71518 5 NA NA NA
factor readmitted 0 71518 3 NA NA NA
factor repaglinide 0 71518 4 NA NA NA
factor rosiglitazone 0 71518 4 NA NA NA
factor tolazamide 0 71518 2 NA NA NA
factor tolbutamide 0 71518 2 NA NA NA
factor troglitazone 0 71518 2 NA NA NA
factor weight 68665 71518 9 NA NA NA
integer admission_source_id 0 71518 NA 5.66 4.16 ▅▇▁▁▁▁▁▁
integer admission_type_id 0 71518 NA 2.1 1.51 ▇▃▃▁▁▁▁▁
integer discharge_disposition_id 0 71518 NA 3.59 5.27 ▇▂▁▁▁▁▁▁
integer encounter_id 0 71518 NA 1.6e+08 1e+08 ▅▇▇▅▃▂▁▁
integer num_lab_procedures 0 71518 NA 43.08 19.95 ▃▃▇▆▂▁▁▁
integer num_medications 0 71518 NA 15.71 8.31 ▆▇▂▁▁▁▁▁
integer num_procedures 0 71518 NA 1.43 1.76 ▇▃▂▂▁▁▁▁
integer number_diagnoses 0 71518 NA 7.25 1.99 ▁▂▅▃▇▁▁▁
integer number_emergency 0 71518 NA 0.1 0.51 ▇▁▁▁▁▁▁▁
integer number_inpatient 0 71518 NA 0.18 0.6 ▇▁▁▁▁▁▁▁
integer number_outpatient 0 71518 NA 0.28 1.07 ▇▁▁▁▁▁▁▁
integer patient_nbr 0 71518 NA 5.5e+07 3.9e+07 ▇▇▅▆▅▁▁▁
integer time_in_hospital 0 71518 NA 4.29 2.95 ▇▇▂▃▂▁▁▁

Now, we will drop some of the columns for various reasons: - Encounter ID: an uneccessary column that provides no information for readmission - Patient Number: an unecessary column that provides no information for readmission - Weight: While potentially useful, there is too many missing values - Payer Code: too many missing values - Medical Specialty: probably unecessary and too many missing values - Diabetes Medication: redundant information - diag_2 and diag_3: ICD codes are hard to deal with and often lack predictive power

drops <- c("encounter_id", "patient_nbr", "weight", "payer_code", 
           "medical_specialty", "diabetesMed", "diag_2", "diag_3")
data <- data[ , !(names(data) %in% drops)]

Next, we need to drop any patients who’s discharge status is listed as “expired”. If a patient has died, there is no chance of readmission.

data <- filter(data, !(discharge_disposition_id %in% c(11, 19, 20, 21)))

We will also remove the 3 subjects with an unknon gender.

data <- data %>% filter(gender != "Unknown/Invalid")
data$gender <- droplevels(data$gender)
Data Encodings
  • Gender will be encoded as 0 for Female, 1 for Male
data$gender <- as.integer(data$gender) - 1
  • Age is given as 10 year intervals. We will split this interval into the median age of the interval.
# Encoding ages
data$age <- ifelse(data$age == "[0-10)", 5, 
                  ifelse(data$age == "[10-20)", 15, 
                         ifelse(data$age == "[20-30)", 25, 
                                ifelse(data$age == "[30-40)", 35, 
                                       ifelse(data$age == "[40-50)", 45, 
                                              ifelse(data$age == "[50-60)", 55, 
                                                     ifelse(data$age == "[60-70)", 65, 
                                                            ifelse(data$age == "[70-80)", 75, 
                                                                   ifelse(data$age == "[80-90)", 85, 95)))))))))
  • Glucose Serum will be encoded as follows:
    • 0 for No measurment
    • 1 for Normal reading
    • 2 for >200 & >300 (abnormal readings)
data$max_glu_serum <- ifelse(data$max_glu_serum == "None", 0, 
                          ifelse(data$max_glu_serum == "Norm", 1, 2))
  • A1C test results will be encoded as follows:
    • 0 for No measurement
    • 1 for Normal reading
    • 2 for an abnormal reading (>7 or >8)
data$A1Cresult <- ifelse(data$A1Cresult == "None", 0, 
                         ifelse(data$A1Cresult == "Norm", 1, 
                                ifelse(data$A1Cresult %in% c(">7", ">8"), 2, NA)))
  • Changes in insulin doage will be encoded as follows:
    • 0 for no insulin
    • 1 for decrease in insulin
    • 2 for steady insulin
    • 3 for increase in insulin
data$insulin <- ifelse(data$insulin == "No", 0, ifelse(data$insulin == "Down", 1, ifelse(data$insulin == "Steady", 2, 3)))
  • Change in Medincations will be encoded as follows:
    • 0 for no change in medication
    • 1 for change in medication
data$change <- ifelse(data$change == "Ch", 1, 0)
  • Race will be encoded as follows:
    • 0 for Caucasian
    • 1 for African American
    • 2 for Other
data$race <- ifelse(data$race == "Caucasian", 0, ifelse(data$race == "AfricanAmerican", 1, 2))
data$race[is.na(data$race)] <- 2
  • Readmitted Status will be encoded as follows:
    • 0 for No Readmission within 30 days
    • 1 for a Readmission in <30 days
data$readmitted <- ifelse((data$readmitted == "NO" | data$readmitted == ">30"), 0, 1)
data %>% group_by(readmitted) %>% summarize(count=n())
## # A tibble: 2 x 2
##   readmitted count
##        <dbl> <int>
## 1          0 64138
## 2          1  6293
New Features
  • Healthcare Utilization is calculated as the sum of number of outpatient, emergency, and inpatient encouters the patient has had in the past year
# Sum of number of outpatient, emergency and inpatient encounters 
data$healthcare_utilization <- (data$number_outpatient + data$number_emergency + data$number_inpatient)
  • Sum on non-insulin diabetes medications
# Creating sum of other meds on column
# (not all of these columns have all four options, but it was easier to code this way - gives same result)
data$metformin <- ifelse(data$metformin %in% c("Up", "Steady", "Down"),1,0)
data$repaglinide <- ifelse(data$repaglinide %in% c("Up", "Steady", "Down"),1,0)
data$nateglinide <- ifelse(data$nateglinide %in% c("Up", "Steady", "Down"),1,0)
data$chlorpropamide <- ifelse(data$chlorpropamide %in% c("Up", "Steady", "Down"),1,0)
data$glimepiride <- ifelse(data$glimepiride %in% c("Up", "Steady", "Down"),1,0)
data$acetohexamide <- ifelse(data$acetohexamide %in% c("Up", "Steady", "Down"),1,0)
data$glipizide <- ifelse(data$glipizide %in% c("Up", "Steady", "Down"),1,0)
data$glyburide <- ifelse(data$glyburide %in% c("Up", "Steady", "Down"),1,0)
data$tolbutamide <- ifelse(data$tolbutamide %in% c("Up", "Steady", "Down"),1,0)
data$pioglitazone <- ifelse(data$pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$rosiglitazone <- ifelse(data$rosiglitazone %in% c("Up", "Steady", "Down"),1,0)
data$acarbose <- ifelse(data$acarbose %in% c("Up", "Steady", "Down"),1,0)
data$miglitol <- ifelse(data$miglitol %in% c("Up", "Steady", "Down"),1,0)
data$troglitazone <- ifelse(data$troglitazone %in% c("Up", "Steady", "Down"),1,0)
data$tolazamide <- ifelse(data$tolazamide %in% c("Up", "Steady", "Down"),1,0)
data$examide <- ifelse(data$examide %in% c("Up", "Steady", "Down"),1,0)
data$citoglipton <- ifelse(data$citoglipton %in% c("Up", "Steady", "Down"),1,0)
data$glyburide.metformin <- ifelse(data$glyburide.metformin %in% c("Up", "Steady", "Down"),1,0)
data$glipizide.metformin <- ifelse(data$glipizide.metformin %in% c("Up", "Steady", "Down"),1,0)
data$glimepiride.pioglitazone <- ifelse(data$glimepiride.pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$metformin.rosiglitazone <- ifelse(data$metformin.rosiglitazone %in% c("Up", "Steady", "Down"),1,0)
data$metformin.pioglitazone <- ifelse(data$metformin.pioglitazone %in% c("Up", "Steady", "Down"),1,0)

data$num_other_diabetes_meds_up_stdy_dwn <- (data$metformin + data$repaglinide + data$nateglinide + data$chlorpropamide + data$glimepiride + data$acetohexamide + data$glipizide + data$glyburide + data$tolbutamide + data$pioglitazone + data$rosiglitazone + data$acarbose + data$miglitol + data$troglitazone + data$tolazamide + data$examide + data$citoglipton + data$glyburide.metformin + data$glipizide.metformin + data$glimepiride.pioglitazone + data$metformin.rosiglitazone + data$metformin.pioglitazone)


drops2 <- c("metformin", "repaglinide", "nateglinide", "chlorpropamide", "glimepiride", "acetohexamide", "glipizide", 
            "glyburide", "tolbutamide", "pioglitazone", "rosiglitazone", "acarbose", "miglitol", "troglitazone", 
            "tolazamide", "examide", "citoglipton", "glyburide.metformin", "glipizide.metformin", "glimepiride.pioglitazone",
            "metformin.rosiglitazone", "metformin.pioglitazone")

data <- data[ , !(names(data) %in% drops2)]
One Hot Encodings
  • Admission types, discharge dispositions, admission source will be grouped and one hot encoded
# set up dummy variables for admission types
data <- mutate(data, at_emergent = as.numeric(admission_type_id %in% c(1, 2, 7)), 
              at_elective = as.numeric(admission_type_id == 3), 
              at_other = as.numeric(admission_type_id %in% c(4, 5, 6, 8)))

# Set up dummy variables for discharge dispositions
data <- mutate(data, dd_home = as.numeric(discharge_disposition_id %in% c(1, 6, 8)), 
              dd_facility_transfer = as.numeric(discharge_disposition_id %in% c(2, 3, 4, 5, 10, 16, 17, 22, 23, 24, 30, 27, 28, 29, 13, 14)), 
              dd_other = as.numeric(discharge_disposition_id %in% c(7, 18, 25, 26, 9, 12, 15))) 
              # dd_admitted = as.numeric(discharge_disposition_id %in% c(9, 12, 15)), #lumping admitted into other
              # add_expired = as.numeric(discharge_disposition_id %in% c(11, 19, 20, 21)), (dropped patients who expired)
              # dd_hospice = as.numeric(discharge_disposition_id %in% c(13, 14))) #lumping hospice into transfer

# Set up dummy variables for admission source
data <- mutate(data, as_outpatient = as.numeric(admission_source_id %in% c(1, 2, 3)),
              as_facility_transfer = as.numeric(admission_source_id %in% c(4, 5, 6, 10, 18, 19, 22, 25, 26)),
              as_ed = as.numeric(admission_source_id %in% c(7)),
              as_other = as.numeric(admission_source_id %in% c(8, 9, 15, 17, 20, 21, 11, 12, 13, 14, 23, 24)))
              #as_newborn = as.numeric(admission_source_id %in% c(11, 12, 13, 14, 23, 24)))#as_newborn added to as_other

data <- within(data, rm(admission_type_id))
data <- within(data, rm(discharge_disposition_id))
data <- within(data, rm(admission_source_id))
  • ICD codes have over 700 different “levels” in our dataframe. We will group these ICD codes into 20 different categories based on the Strack et al. 2014 paper. We will then one hot encode these categories for each patient.
data$diag_1 <- as.character(data$diag_1)

data$diag_1grp <- ifelse (data$diag_1 == "?", "Unknown", 
          ifelse(grepl(data$diag_1, pattern = "[EV]") == T, "External", 
            ifelse(floor(as.numeric(data$diag_1)) == 250, "Circulatory", 
              ifelse(data$diag_1 %in% c(390:459, 785), "Diabetes", 
                ifelse(data$diag_1 %in% c(460:519, 786), "Respiratory", 
                  ifelse(data$diag_1 %in% c(520:579, 787), "Digestive", 
                    ifelse(data$diag_1 %in% c(800:999), "Injury", 
                      ifelse(data$diag_1 %in% c(710:739), "Musculoskeletal",
                        ifelse(data$diag_1 %in% c(580:629, 788), "Genitourinary", 
                         ifelse(data$diag_1 %in% c(140:239), "Neoplasm", 
                           ifelse(data$diag_1 %in% c(780, 781, 784, 790:799, 740:759), "Other", # added congenital to other
                             ifelse(data$diag_1 %in% c(240:249, 251:279), "Endocrine_Nutrition_Metabolic",  
                               ifelse(data$diag_1 %in% c(680:709, 782), "Skin", 
                                 ifelse(data$diag_1 %in% c(001:139), "Infectious",
                                   ifelse(data$diag_1 %in% c(290:319), "Mental", 
                                     ifelse(data$diag_1 %in% c(280:289), "Blood",
                                       ifelse(data$diag_1 %in% c(320:359), "Nervous",
                                         ifelse(data$diag_1 %in% c(630:679), "Pregnancy", 
                                          ifelse(data$diag_1 %in% c(360:389), "Sense", "Unknown") 
                                            #ifelse(data$diag_1 %in% c(740:759), "Congenital", "NULL")
                                ))))))))))))))))))
## Warning in ifelse(floor(as.numeric(data$diag_1)) == 250, "Circulatory", :
## NAs introduced by coercion
data$diag_1grp <- as.factor(data$diag_1grp)

data <- one_hot(data.table(data))
data <- as.data.frame(data) # convert back to data.frame format

data <- within(data, rm(diag_1))
data <- na.omit(data)
data <- data %>% select(-readmitted, readmitted) # move readmitted to the last column value
cor_mat <- cor(data[, 1:46])
cor_mat[!lower.tri(cor_mat)] <- 0
data <- data[,!apply(cor_mat,2,function(x) any(abs(x) > 0.6))]

cor_mat2 <- cor(data[, 1:41])

get_lower_tri<-function(cormat){
    cormat[upper.tri(cormat)] <- NA
    return(cormat)
}

cor_mat_melted <- melt(get_lower_tri(cor_mat2), na.rm = T)

ggplot(data = cor_mat_melted, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1),
        legend.position = c(0,1), legend.justification = c(0,1),
        axis.title = element_blank(), plot.title = element_text(hjust = 0.5)) +
  labs(title = "Correlation matrix") +
  scale_y_discrete(position = "right")

#corrplot(cor_mat2, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)

Train Test Split

The dataset is now ready, and we will split it into training(80%) and test(20%) datasets using caret’s createDataPartition function.

# write cleaned dataset
write.csv(data, file = "dataset/clean_diabetic_data.csv", row.names = F)

# Create the training and test datasets
set.seed(123)

# Step 1: Get row numbers for the training data (80% train split, 20% testing split)
trainRowNumbers <- createDataPartition(data$readmitted, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
trainData <- data[trainRowNumbers,]

# Step 3: Create the test dataset
testData <- data[-trainRowNumbers,]

# Step 4: Save the datasets 
write.csv(trainData, file = "dataset/trainData.csv", row.names = F)
write.csv(testData, file = "dataset/testData.csv", row.names = F)

Exploratory Data Analysis

PCA and t-SNE

pca_tsne_df <- data[,c(1:41)]
# scale variables
for(i in 1:ncol(pca_tsne_df)){
  pca_tsne_df[[i]] <- ((pca_tsne_df[[i]] - mean(pca_tsne_df[[i]])) / sd(pca_tsne_df[[i]]))
}

pca_tsne_df <- t(pca_tsne_df)

#PCA

PC <- prcomp(pca_tsne_df)

var_explained <- data.frame("PC" = c(1:41), "eigenvalue" = (PC$sdev)^2)
var_explained$var_explained <- var_explained$eigenvalue / sum(var_explained$eigenvalue)
var_explained$cumulative <- cumsum(var_explained$var_explained) / sum(var_explained$var_explained)

ggplot(var_explained, aes(x = PC, y = var_explained)) + 
  geom_line() +
  labs(title = "Elbow plot: proportion of variance explained by each PC")

ggplot(var_explained, aes(x = PC, y = cumulative)) + 
  geom_line() +
  labs(title = "Cumulative variance explained by PC_1 through PC_n",
       x = "n", y = "vat_explained")

outdf <- cbind(data, PC$rotation)
outdf$readmitted <- as.factor(outdf$readmitted)

ggplot(outdf, aes(x = PC1, y = PC2)) +
  geom_point(aes(color = readmitted), alpha = 0.5) +
  scale_color_manual(values = c("cadetblue3", "blue")) +
  labs(title = "Classes are not well-separated by principal components")

# TSNE
tsne_model_1 = Rtsne(t(as.matrix(pca_tsne_df)), check_duplicates=FALSE, pca=TRUE,perplexity = 10, theta=0.5, dims=2)

## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_model_1$Y) 

d_tsne_1$readmitted <- as.factor(data$readmitted)

## plotting the results without clustering
ggplot(d_tsne_1, aes(x=V1, y=V2)) +  
  geom_point(aes(color = readmitted)) +
  scale_color_manual(values = c("cadetblue3", "blue")) +
  labs(title = "Classes are not well-separated by t-SNE")

We do not see separation of readmitted classes by PCA or t-SNE analysis.

Composition Graphs

ggplot(data=data, aes(x=factor(race, labels = c("Caucasian", "African American", "Other")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of Race", x="Race", y="Count", fill="Readmission Status") 

ggplot(data=data, aes(x=factor(gender, labels = c("Female", "Male")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of Gender", x="Gender", y="Count", fill="Readmission Status") 

ggplot(data=data, aes(x=factor(A1Cresult, labels = c("No Measurement", "Normal Reading", "Abnormal")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of A1C Test Result", x="A1C", y="Count", fill="Readmission Status") 

ggplot(data=data, aes(x=factor(insulin, labels = c("No Insulin", "Decrease in Insulin", "Steady Insulin", "Increase in Insulin")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of Insulin Change", x="Insulin Change", y="Count", fill="Readmission Status")

ggplot(data=data, aes(x=factor(change, labels = c("No Change", "Change in Medication")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of Medication Change", x="Medication Change", y="Count", fill="Readmission Status")

Distribution Graphs

ggplot(data=data, aes(x=time_in_hospital, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="Time in Hospital", y="Density", 
       title="Distribution of Time in Hospital Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=time_in_hospital)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="Time in Hospital", x="Readmission Status", 
       title="Distribution of Time in Hospital Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=num_lab_procedures, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Lab Proedures", y="Density", 
       title="Distribution of # of Lab Procedures Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=num_lab_procedures)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Lab Procedures", x="Readmission Status", 
       title="Distribution of # of Lab Procedures Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=num_procedures, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Non-Lab Proedures", y="Density", 
       title="Distribution of # of Non-Lab Procedures Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=num_procedures)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Non-Lab Procedures", x="Readmission Status", 
       title="Distribution of # of Non-Lab Procedures Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=num_medications, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Medications Administered", y="Density", 
       title="Distribution of # of Medications Administered Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=num_procedures)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Medications Administered", x="Readmission Status", 
       title="Distribution of # Medications Administered Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=number_diagnoses, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Diagnoses", y="Density", 
       title="Distribution of # of Diagnoses Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=number_diagnoses)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Diagnoses", x="Readmission Status", 
       title="Distribution of # of Diagnoses Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=age, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="End of Age Interval", y="Density", 
       title="Distribution of End Age of Interval Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=age)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="End of Age Interval", x="Readmission Status", 
       title="Distribution of End Age of Interval Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=healthcare_utilization, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="Healtchare Utilization", y="Density", 
       title="Distribution of Healthcare Utilization by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=healthcare_utilization)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="Healthcare Utilization", x="Readmission Status", 
       title="Distribution of Healtchare Utilization Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=num_other_diabetes_meds_up_stdy_dwn, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Diabetes Medications Changed", y="Density", 
       title="Distribution of # of Diabetes Medications Changed Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=num_other_diabetes_meds_up_stdy_dwn)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Diabetes Medications Changed", x="Readmission Status", 
       title="Distribution of # of Diabetes Medications Changed Grouped by Readmission Status", fill="Readmission Status")

SMOTE Algorithm For Unbalanced Classification Problems

“SMOTE is an oversampling method which creates new instances of the minority class by forming convex combinations of neighboring instances. It effectively draws lines between minority points in the feature space, and samples along these lines. This allows us to balance our data-set without as much overfitting, as we create new synthetic examples rather than using duplicates. This however does not prevent all overfitting, as these are still created from existing data points.” (https://towardsdatascience.com/dealing-with-imbalanced-classes-in-machine-learning-d43d6fa19d2)

if(file.exists("dataset/trainSMOTE.csv")) {
  trainData_SMOTE <- read.csv("dataset/trainSMOTE.csv")
} else {
  trainData$readmitted <- as.factor(trainData$readmitted)
  trainData_SMOTE <- SMOTE(readmitted ~ ., trainData) 
  write.csv(trainData_SMOTE, "dataset/trainSMOTE.csv", row.names = F)
}

XGBoost

XGBoost stands for eXtreme Gradient Boosting. You can read more about it here.

# create XGB Boost dense matrix objects --> faster performance 
dtrain <- xgb.DMatrix(data = as.matrix(trainData_SMOTE[, 1:41]), label = as.numeric(trainData_SMOTE$readmitted))
dtest <- xgb.DMatrix(data = as.matrix(testData[, 1:41]), label = as.numeric(testData$readmitted))

Hyperparameter optimization

In machine learning, hyperparameter optimization is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.

if(file.exists("xgb_params.csv")) {
    best_params <- read.csv("xgb_params.csv")
} else {
  param_grid <- expand.grid(objective = "binary:logistic",
                          eval_metric = "auc", 
                          eta = c(0.1, 0.3, 0.5), 
                          max_depth = c(2,4,6),
                          gamma = c(2,3,4), 
                          scale_pos_weight = c(2, 3))

  best_test_auc <- 0
  best_params <- param_grid[1, ]
   
  for (i in 1:nrow(param_grid)) {
    print(i)
    xgb_cv <- xgb.cv(params = param_grid[i, ], data = dtrain, nrounds = 150, 
                     nfold = 10, early_stopping_rounds = 5, verbose = F)
    if(xgb_cv$evaluation_log[nrow(xgb_cv$evaluation_log), test_auc_mean] > best_test_auc) {
      best_test_auc = xgb_cv$evaluation_log[nrow(xgb_cv$evaluation_log), test_auc_mean] 
      best_params <- param_grid[i, ]
    } 
  }
  write.csv(best_params, "xgb_params.csv", row.names = F)
}

Training the Model

xgb <- xgb.train(params = best_params, data = dtrain, nrounds = 500)
xgbPredict <- predict(xgb, dtest)
xgbPredict <- as.numeric(xgbPredict> 0.5)
confusionMatrix(reference = as.factor(testData$readmitted), data = as.factor(xgbPredict), mode = "everything", positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9761  753
##          1 3033  537
##                                           
##                Accuracy : 0.7312          
##                  95% CI : (0.7238, 0.7385)
##     No Information Rate : 0.9084          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0999          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.41628         
##             Specificity : 0.76294         
##          Pos Pred Value : 0.15042         
##          Neg Pred Value : 0.92838         
##               Precision : 0.15042         
##                  Recall : 0.41628         
##                      F1 : 0.22099         
##              Prevalence : 0.09159         
##          Detection Rate : 0.03813         
##    Detection Prevalence : 0.25348         
##       Balanced Accuracy : 0.58961         
##                                           
##        'Positive' Class : 1               
## 

Variable Importance According to XGBoost

importance <- xgb.importance(model = xgb)
xgb.ggplot.importance(importance_matrix = importance, top_n = 10)

Simple ROC Curve

ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. You can read more about it here.

pred <- prediction(xgbPredict, testData$readmitted)
AUC <- performance(pred, "auc")@y.values[[1]]
perf_ROC=performance(pred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(AUC, digits=3, scientific=FALSE)))

KNN

ctrl <- trainControl(method="cv", number = 5)
knn <- train(as.factor(readmitted) ~ ., data = trainData_SMOTE, method = "knn", metric = "kappa",
             trControl = ctrl, preProcess = c("center","scale"), tuneLength = 10)
## Warning in train.default(x, y, weights = w, ...): The metric "kappa" was
## not in the result set. Accuracy will be used instead.
knn
## k-Nearest Neighbors 
## 
## 35007 samples
##    41 predictor
##     2 classes: '0', '1' 
## 
## Pre-processing: centered (41), scaled (41) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 28005, 28006, 28006, 28006, 28005 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7256547  0.4574105
##    7  0.7103434  0.4251942
##    9  0.6953464  0.3938713
##   11  0.6867195  0.3750039
##   13  0.6782070  0.3566975
##   15  0.6700658  0.3392946
##   17  0.6662666  0.3305906
##   19  0.6632959  0.3237004
##   21  0.6580968  0.3121660
##   23  0.6524122  0.2997506
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(knn)

knnPredict <- predict(knn, testData)
confusionMatrix(reference = as.factor(testData$readmitted), data = as.factor(knnPredict), mode = "everything", positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8111  704
##          1 4683  586
##                                           
##                Accuracy : 0.6175          
##                  95% CI : (0.6094, 0.6255)
##     No Information Rate : 0.9084          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.037           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.45426         
##             Specificity : 0.63397         
##          Pos Pred Value : 0.11122         
##          Neg Pred Value : 0.92014         
##               Precision : 0.11122         
##                  Recall : 0.45426         
##                      F1 : 0.17869         
##              Prevalence : 0.09159         
##          Detection Rate : 0.04161         
##    Detection Prevalence : 0.37411         
##       Balanced Accuracy : 0.54412         
##                                           
##        'Positive' Class : 1               
## 
knnPred <- prediction(as.numeric(knnPredict), testData$readmitted)
knn_AUC <- performance(knnPred, "auc")@y.values[[1]]
knn_perf_ROC=performance(knnPred,"tpr","fpr") #plot the actual ROC curve
plot(knn_perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(knn_AUC, digits=3, scientific=FALSE)))